fix: correct judge prompt construction in MM-MT-Bench by abdelhadi703 · Pull Request #14 · mistralai/mistral-evals

abdelhadi703 · 2026-03-03T12:16:31Z

Summary

Fix message content extraction in judge prompt construction (iterating over dict keys instead of values)
Fix reference answer extraction (passing full dict instead of text content)
Add image support in judge prompts (convert PIL images to base64 for the judge model)
Fix _add_or_append_chunk() to actually append image chunks to the prompt
Fix operator precedence in health check condition (models.py)

Fixes #8

Changes

eval/tasks/mm_mt_bench.py: Extract .content from message dicts, extract text from reference answer chunks, convert PIL images to base64 image_url chunks, fix image chunk appending
eval/models.py: Add parentheses to fix operator precedence in _wait_till_healthy()

Test plan

Run MM-MT-Bench evaluation and verify judge prompts are correctly formatted
Verify health check logic works with both empty body and JSON status responses

🤖 Generated with Claude Code

- Extract message content from dicts in get_judgement() instead of passing full message dicts to the judge prompt builder - Add _extract_text_content() to extract plain text from reference answers that may be lists of typed chunks - Add _convert_image_chunks() to convert PIL image objects to base64 image_url chunks for the judge model - Fix _add_or_append_chunk() to actually append image chunks to the prompt instead of returning them - Fix operator precedence in _wait_till_healthy() health check condition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abdelhadi703 · 2026-03-30T01:15:58Z

Hi @mistralai/team,

Bumping this fix for MM-MT-Bench judge prompt construction. It ensures correct prompt formatting for evaluation.

Happy to address any review comments. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct judge prompt construction in MM-MT-Bench#14

fix: correct judge prompt construction in MM-MT-Bench#14
abdelhadi703 wants to merge 1 commit into
mistralai:mainfrom
abdelhadi703:fix/mm-mt-bench-judge-prompt

abdelhadi703 commented Mar 3, 2026

Uh oh!

abdelhadi703 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abdelhadi703 commented Mar 3, 2026

Summary

Changes

Test plan

Uh oh!

abdelhadi703 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant